Empirical Term Weighting and Expansion Frequency

نویسندگان

  • Kyoji Umemura
  • Kenneth Ward Church
چکیده

We propose an empirical method for estimating term weights directly from relevance judgements, avoiding various standard but potentially troublesome assumptions. It is common to assume, for example, that weights vary with term frequency (t f ) and inverse document frequency (idf) in a particular way, e.g., t f . idf, but the fact tha t there are so many variants of this formula in the literature suggests that there remains considerable uncertainty about these assumptions. Our method is similar to the Berkeley regression method where labeled relevance judgements are fit as a linear combination of (transforms of) t f, idf, etc. Training methods not only improve performance, but also extend naturally to include additional factors such as burstiness and query expansion. The proposed histogram-based training method provides a simple way to model complicated interactions among factors such as t f , idf, burstiness and expansion frequency (a generalization of query expansion). The correct handling of expanded term is realized based on statistical information. Expansion frequency dramatically improves performance from a level comparable to BKJJBIDS, Berkeley's ent ry in the Japanese NACSIS NTCIR-1 evaluation for short queries, to the level of JCB1, the top system in the evaluation. JCB1 uses sophisticated (and proprietary) natural language processing techniques developed by Just System, a leader in the Japanese word-processing industry. We are encouraged that the proposed method, which is simple to understand and replicate, can reach this level of performance. 1 I n t r o d u c t i o n An empirical method for estimating term weights directly from relevance judgements is proposed. The method is designed to make as few assumptions as possible. It is similar to Berkeley's use of regression (Cooper et al., 1994) (Chen et al., 1999) where labeled relevance judgements are fit as a linear combination of (transforms of) t f , idf, etc., but avoids potentially troublesome assumptions by introducing histogram methods. Terms are grouped into bins. Weights are computed based on the number of relevant and irrelevant documents associated with each bin. The result• t: a term

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion

This thesis focuses on term weighting in short documents. I propose weighting approaches for assessing the importance of terms for three tasks: (1) document categorization, which aims to classify documents such as tweets into categories, (2) keyword extraction, which aims to identify and extract the most important words of a document, and (3) keyword association modeling, which aims to identify...

متن کامل

Developing a model for simulating urban expansion based on the concept of decision risk: A case study in Babol city

Today, the study of the spatial-temporal pattern of urban physical expansion and the identification of the parameters affecting the expansion play a crucial role in urban-related decision-making and long-term planning processes. Consequently, the use of precise and efficient methods to predict the physical expansion of urban areas is of great importance. The objective of present study is to pro...

متن کامل

Biomedical Text Mining about Alzheimer's Diseases for Machine Reading Evaluation

The paper presents the experiments carried out as part of the participation in the pilot task of Biomedical about Alzheimer for QA4MRE at CLEF 2012. We have submitted total five unique runs in the pilot task. One run uses Term Frequency (TF) of the query words to weight the sentence. Two runs use Term Frequency-Inverted Document Frequency (TF-IDF) of the query words to weight the sentences. The...

متن کامل

Simple Weighting Techniques for Query Expansion in Biomedical Document Retrieval

In this paper, we propose two weighting techniques to improve performances of query expansion in biomedical document retrieval, especially when a short biomedical term in a query is expanded with its synonymous multi-word terms. When a query contains synonymous terms of different lengths, a traditional IR model highly ranks a document containing a longer terminology because a longer terminology...

متن کامل

YFilter at TREC-9

We built a filtering system YFILTER this year, which we used for experiments on profile updating and thresholds setting. Our focus is using incremental Rocchio for introducing new query terms and term weighting. Although 1, 0.5, 0.25 is a widely used Rocchio ratio for query expansion based on relevance feedback, we found that the optimal setting for information filtering is corpus and profile d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000